Introduction

For amyloid to form, Aβ must be nucleated into aggregated Aβ seeds. Fibril formation is characterized by this slow nucleation step, proceeded by a fast growth phase. Recent studies show the presence of certain proteins at the center of Aβ plaques, most notably FAM222A and MBP have been discovered at the core of plaques. These studies suggest there are proteins that could potentially promote primary nucleation. Nucleation can be promoted by several other factors such as positively charged metal ions (Cu2+) and low pH levels. However, only a handful studies have examined protein-based nucleation. Here, we look at the protein composition of these plaques and identify potential nucleators, proteins that promote Aβ42 nucleation. We define several basic conditions for potential nuceators: (1) A nucleator must have a attractive charge interaction with Aβ42 that is balanced with the repulsive self-interaction of Aβ42; (2) There is sufficient attractive potential between the nucleator and Aβ42 to accommodate multiple strands of Aβ42; (3) There is specific-interaction potential with the N-terminal end of Aβ42. We analyzed these conditions by evaluating the charge-composition of these plaque proteins with pH ranging from the theoretical isoelectric point of Aβ42, 5.31, to standard physiological pH, 7.4. Here, we determine a list of nucleator candidates and examine biologically known associations of these proteins with APP-related proteins via STRING network analysis. In a separate part, we follow this by examining differential expression in both bulk and single-cell RNAseq.

Preprocessing

A protein composition list was obtained from Drummond et al.(see link at bottom). The proteomics data set was derived using label-free quantitative LC–MS/MS. The list consist of 279 proteins that were consistently found in the plaques of 30 AD individuals. They provide intensities, the relative expressions in the 30 individuals and the UniProt IDs for the respective proteins. The original intent of the study was to differentiate the composition of plaques between sAD and rpAD patients. However, we use all of the proteins that are found in all individuals. Protein sequences were obtained using the given UniProt IDs. Only 1 of the 279 proteins were unidentified in the UniProt database and removed from the dataset. Furthermore, two additional sequences were added to the dataset: Aβ42 and FAM222A. FAM222A has only been recently discovered via GWAS on AD patients and hasn’t even been fully characterized to date. However, a recent experimental study has suggested that it is found in the core of Aβ aggregates and exacerbates Aβ aggregation in AD mouse model under intracerebroventricular infusion (forced expression).

Selecting pKa Scale for Net Charge Prediction

We start the investigation by calculating sequence-based theoretical net charge of our proteins. The first is the theoretical net charge as predicted using the pKa scale derived by Bjellqvist. Specifically, we used the R package Peptides to perform this calcuation for the whole dataset. The Bjellqvist scale was chosen due to the fact that it showed optimal performance in a benchmark study of isoelectric point (pI) prediction. This pKa scaled showed to give the lowest RMSD values, second to an SVM method, when predicting pI. It also showed to be the have the most consistent RMSD score for various types of proteins. Out of all the pKa scales, it one of the most established and heavily used pKa scales. It also showed the best performance when predicting the theoretical isoelectric point of Aβ42. The plot shows the pI predictions of each of the available scales as well as error bars with the peptide RMSD values calculated in the benchmark study. The dashed line shows the pH 5.5 at which optimal Aβ42 aggregation occurs. The solid line is theoretical pI prediceted by the Bjellqvist scales. As can be seen, the EMBOSS and Sillero scales predict the pI almost exactly at this point. This could, in fact, suggest that Aβ42 optimally self-aggregates at its isoelectric. However, pKa scales only give rough approximations of the net charge and the predictions given by any scale should not be interpreted with heavy weight. For the purposes of this study, all of the scales give the same results when selecting candidate nucleators. Considering these reasons, the Bjellqvist scale was selected to due to its accuracy, consistency and previous use in another relevant study.

Residue Charges of Aβ42

First, the theoretical charge was calculated for each residue of the Aβ42 polypeptide over for pHs ranging from 5.3 to 7.4. The four most pH responsive residues in Aβ42 are the N-terminal aspartic acid (D) and the three histidine (H) residiues close within the N-terminal half of the chain. The other negatively charged residues consist of 2 aspartic acids (D), 3 glutamic acids (E) and the C-terminal alanine (A). The positively charged residues consist of 1 arginine (R) and two 2 lysines (A). As a results, the net charge of Aβ42 shifts from -3.34 to 0 over pHs 7.4 to 5.3. This small shift in charge may seem insignificant at first. When considering the abundance and size of AB42, this shift in charge is quite significant.

Histidine Abundance

Another interesting feature of AB42 is the number of histidine residues located near the N-terminal end of the chain. The experimental study ofFAM222A nucleation noted that FAM222A specifically binds to the N-terminal end of Aβ42. Histidineis important for maintenance of myelin sheaths that protect nerve cells and is metabolized to the neurotransmitter histamine.

“PULLED FROM LITERATURE” The molecular interactions of histidine with other amino acids and metallic cations in proteins can be classified into the following five types:

  1. Cation-π interaction. The side chain imidazole of His is an aromatic ring. Histidine can take part in the cation-π interactions as the aromatic motif with metallic cations or organic cations (protonated amino acids, Lys+ and Arg+). On the other hand, the protonated His+ is an organic cation, which can join the cation-π interactions as an organic cation with other aromatic amino acids (Phe, Tyr, and Trp).

  2. π-π stacking interaction. The imidazole structure of histidine side chain is a conjugative π-plane, which can make π-π stacking interactions with the aromatic side chains of other amino acids (Phe, Tyr, and Trp).

  3. hydrogen-π interaction. The polar hydrogen atom of histidine can form hydrogen-π bond with other aromatic amino acids in ‘T’ orientation.

  4. Coordinate bond interaction. The basic nitrogen atom in the imidazole of histidine has a lone electron pair that make it a coordinate ligand of metallic cations, such as Zn2+, Ca2+ and Cu2+.

(5)Hydrogen bond interaction. The polar hydrogen atom of the imidazole is a hydrogen-bond donor, and the basic nitrogen atom is a hydrogen-bond acceptor. “PULLED FROM LITERATURE”

There are 3 phenylalanines (F) and 1 tyrosine (Y) located within Aβ42. Intermolecular hydrogen-π interactions form within Aβ42 or intramolecular hydrogen-π interactions with other chains seems insignificant in nucletion considering Aβ42 is mostly amorphously collapsed and such interactions would require specific orientations that would rarely occur. This rules out interaction types (2) & (3).

The binding Aβ42 to multivalent cationic metals via histidine is well known as mentioned before but would only apply to metal-Aβ42 interactions. This rules out interaction type (4).

The interaction types that would apply to protein-based nucleation are mostly-likely to be cation-π interactions and hydrogen bond interactions.

Charge-based Descriptors

Outside the realm of histidine, charge-to-charge interactions have the longest range coompared to any other type. A strong nucleator and Aβ42 should be attracted to each other of long distances (at 10-20A). Therefore, the charge composition of Aβ42 and other proteins should play a critical in the role nucleation of Aβ42 seeds. For this reason, we will classify the plaque proteins and quantify interactions with Aβ42 using charge-based descriptors.

Using the Bjellqvist pKa scale, the theoretical net charge was computed for all 280 proteins for pH ranging from 5.3 to 7.4 with 0.01 intervals. Another important parameter to consider is the amino acid (AA) length of each protein. The net charge of each protein is normalized by its AA length, giving the linear charge density. The linear charge density is a common variable used polyelectroyte theory to describe polyelectrolyte aggregation and coarse-grained several studies have focused on its impact on generic polyelectrolyte aggregation. Here, we use it as a measure for pH-responsiveness and as a descriptor for unsupervised clustering of our proteins.

UMAP Transform for Clustering

The Uniform Manifold Approximation and Projection (UMAP) algorithm as implemented in the uwot R package was used to transform the charge density feature space. The reason for using UMAP transform was to store the plaque protein feature space in a way that could be applied to larger list of proteins for downstream analysis. Furthermore, making use of the UMAP embedding space makes clustering less ambiguous and aids in the repeatability in downstream analysis, especially when using HDBSCAN. The goal of clustering proteins in this part of the analysis is to simply filter our proteins into broad groups so that we can identify potential nucleator candidates and not to groupthem discretely based on strict relationships such as homology or secondary structure.

The charge densities were transformed into 8 UMAP components using a Euclidean distance metric. The reason for selecting 8 components was to accommodate the optimum feature space rule which is defined by:

Number of Samples = 2^(Number of Features) -> Number of Features = log2(Number of Samples)

Since we have 280 proteins we should use ~8 features (i.e. UMAP components) for clustering. Selected parameters include 8 number of neighbors for the KNN graph and a negative sample rate of 8. Since the UMAP embedding output is dependent on the selected random seed. This transform was performed 10 separate times using 10 different random seeds. A Euclidean distance matrix was computed for each UMAP embedding to provide a averaged distance matrix that could be used for HDBSCAN clustering. Each of the 10 UMAP transforms were stored for later application.

Clustering with HDBSCAN

Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN) was used for clustering the plaque proteins UMAP transform space. The purpose of using HDSCAN is too identify strong clusters and too disclude proteins that do not easily fit a cluster profile. At this point in the analysis, it is not needed effectively cluster such a small dataset of 280 proteins. However, it plays an important role in the downstream analysis when we’re analyzing ~10,000-20,000 proteins. The minimum size of the clusters was selected to be 8, considering we are using only 8 UMAP components. Here, only 3 proteins were unsuccessfully clustered (colored in gray). The HDBSCAN dendrogram is shown below with the corresponding random cluster colors for as well as cluster colors corresponding to charge density ranging from red to blue for negative and positive charges, respectively.

Cluster Visualization

We can visualize these clusters of 2 component UMAP embedding of the charge densities.As can be seen, the HDSCAN protein clustering of the 8-component embedding visually agrees with the 2-component embedding.

The plot below shows the 2-component UMAP embedding of the cluster. The clusters are labeled by number and colored randomly for easy visual identification.

The plot below is same as the above plot. However, the clusters are colored by their average charge density. As can be seen, cluster 2 has the highest negative charge density, followed by clusters 15 & 8. Clusters 3, 4, 5, 7, 9, 10 12 & 14 have weaker net charge densities. Clusters 1, 11 & 13 are all highly negative charged with 13 being the highest. Interestingly, both FAM222A and MBP fall into Cluster 2 and Aβ42 falls in to Cluster 1. These clusters are significantly oppositely charged.

Change of Charge vs. pH

With our proteins clustered, it’s easier to spot general trends in the charge response to pH. We’ll look at the change of net charge, charge density and the gradient of charge density of all the plque proteins.

This first plot shows the change of the net charge with respect to pH. The black dashed line is the charge of Aβ42. The names of the Cluster 2 proteins are labeled on the side of the plot with corresponding to to their net charge at a pH of 7.4. The net charge of Aβ42 is extremely small compared to a lot of the other polypeptides. This is due to the fact that Aβ42 is by far the smallest chain in the plaque proteins. This is why it’s better to examine charge after its been normalized by the AA length, i.e. the charge density.

This plot similar to before but uses the charge density in place of net charge. As can be seen, Aβ42 has one of the highest negative charge densities at pH 7.4. Interestingly, FAM22A has the steepest response to pH. It’s charge density goes from -0.079 to 0 over as the pH is lowered, the widest range out of any other polypeptide. The This is why charge density was used for clustering. The clustering also shows that we’re grouping proteins even over large pH differences. For example, HISTH4A and HIST1H2BC are much highly than the rest of Cluster 2. For our purposes this grouping is appropriate because we’re trying to identify all of these possible highly positively charged candidates.

If we plot the gradient of the charge density with respect to pH, we can clearly see that Aβ42 is the most pH responsive almost over the whole range, up to a pH ~5.5. This the next steepest gradient is that of HBA1 which is identified in Cluster 2. There may be some interesting relationship between Aβ42 strands and HBA1. However, it is difficult to make any assumptions. As a side note, HBA1 is also relatively small chain with a high histidine composition and this is probably why it has such a steep response to pH.

This can be taken step further and we can look at the curvature response to pH. In this, We can see two intersections with zero, at pH ~5.5 and ~5.8. This means that the pH response is nearly linear within this range. This is also the range at which optimum aggregation occurs, suggesting something intersting might be occurring with histidine residues at this point. It’s too difficult interpret anything meaningful since the net charge calculations are only rough estimates.

Interaction Potential with Abeta 42

Next, we estimated the electrostatic potential between Aβ42 and our other proteins.First, let’s consider some our limitations. The coloumbic potential energy between any two points can be described below:

E = kq1q2/r

Since we want to describe the energetic potential disregarding distance between them, we can disclude the r and since k is just a constant we can rewrite our expression purely using charge that is directly proportional to the electrostatic energy/force.

Ecol ~ E = q1q2

Next, we want to describe the total electrostatic potential between all of the residues in two proteins. This would be the product of the net charges of the two proteins:

Net Charge: Qi = sum(qi) OR Qj = sum(qj)

Etot = sumi(sumj(qiqj)) = sum(qi)sum(qj) = QiQj

Finally, we want to normalize the charge interaction by the AA lengths of the two proteins:

Enorm = QiQj/(Ni+Nj)

This is how we score the electrostatic potential between Aβ42 and the other proteins.

Interaction Potential with Abeta 42

We can calculate the average interaction potential score over the whole pH range, 5.3-7.4. This is shown in the box plot below. Each box represents a cluster and the dots are individual proteins. The positive horizontal dashed line is the repulsive self-interaction potential score of Aβ42. The negative horizontal dotted line is the opposite Aβ42 self-interaction potential. In theory, any proteins with an interaction potential around this opposite self-interaction potential would be sufficient in overcoming the electrostatic repulsion between two Aβ42 strands. As can be seen, all of the proteins in Cluster 2 have significantly greater attractive potential than any other cluster. This isn’t surprising considering that Cluster 2 has the most negative charge density compared to anybody else. Furthermore, it is the only cluster that overlaps with the pposite self-interaction potential of Aβ42.

Interaction Potential with Abeta 42 over pH Range

We can also show the interaction potential over the whole range of pH values. The plot below shows the self-interaction of Aβ42 (black dashed line) goes to zero at it’s isoelectric point of 5.31. However, it appears to have significantly dropped around 5.5 and it could be that the repulsive electrostatic force between Aβ42 strands becomes insignificant at this level, in which the thermal energy of the system could easily be sufficient to overcome any energetic barriers. This suggests that pH itself is, in fact, a strong nucleator when looking in the context of charge-charge interactions. However, the self-interaction potential of Aβ42 is still quite high in less acidic pH conditions. This suggests that other factors should play an important in the nucleation of Aβ42 in less acidic conditions. This has been shown with presence divalent metal ions and seen in certain cases of proteins, such as FAM222A and MBP (specifically it’s degraded form). In theory, a strong nucleator should have an attractive interaction with Aβ42 that is balanced with the repulsive self-interaction of Aβ42. This can be described in the equation below, where subscript a represents Aβ42 and b denotes any other polypeptide:

Etot = Eab + Eaa = 0

Eij = -Eaa, where Eaa is always positive

If Eij > 0, there is a net repulsive electrostatic force with Aβ42 and the two strands are unlikely to directly interact as well as form a complex that that even less likely to draw more Aβ42. If Eab << -Eaa, Aβ42 will interact strongly with the protein of interest but would discourage the self-association Aβ42 strands. Ideally, the net interaction potential of nucleator-Aβ42 association is close enough to overcome Aβ42-Aβ42 charge repulsion. Therefore, ideal candidates will have an association (Eab) that is equally opposite to Aβ42-Aβ42 charge repulsion (-Eaai) and the root mean square difference (RMSD) between the two potentials should be signicantly small. In fact, it is best compute this difference over a less acidic pH range. The distance should computed over a range instead of a single pH considering that a strong nucleator should effective under a range of conditions. Furthermore, there are limitations to these charge estimates and higher sampling should be considered. Consequently, the RMSD was calculated for the upper half of the range, from 6.35 to 7.4. To determine the potentials that were significantly closer, to the ideal association potential, the log10 was taken of the RMSDs and a Z-Score was taken to determine the outliers. Any protein with a Z-Score less than -2 was considered an ideal nucleator with respect to electrostatic interactions.

RMSD = sqrt((Eab - (-Eaa))^2) = sqrt((Eab + Eaa)^2) logRMSD = log10(RMSD) Z-Score = (logRMSD - mean(logRMSD))/sd(logRMSD)

The gene names of all the proteins plotted with their respective Z-Scores. All of the selected ideal nucleators are highlighted in blue the text plot below. The names are also highlighted in the plot above. Interestingly, we find our already experimentally determined nucleators (FAM222A and MBP) along with other interesting candidates such as Tubulin Polymerization Promoting Protein (TPPP).

Next, a strong nucleator should have a sufficient amount of net charge to accommodate multiple Aβ42 strands. Therefore, we use the un-normalized potential Etot. The red lines show the self-interaction potential of Aβ42 factored by a negative multiplier. In theory, the number of Aβ42 chains a protein can accommodate is equal or greater than the absolute factor of the self-interaction potential in which the nucleator-Aβ42 is less than or equal to. For instance, it is estimated that SYN1 can accommodate approximately ~6.5 Aβ42 strands at a pH of 7.4 while COX4I1 can accommodate slightly more than 2. This shows that all of the candidates can at least accomdate approximately 2 strands. However, several studies have shown that the net charge of polyelectrolyte complexes can carry excess charge. Therefore, these estimations could possibly underestimating the number of Aβ42 strands that can be accommodated.

STRING Network Analysis with Cluster 2 & APP

Now, we examine the macroscopic biological associations within Cluster 2 and the amyloid precursor protein (APP) by using Search Tool for the Retrieval of Interacting Genes/Proteins (STRING). First, we search full STRING networks for each of the 18 proteins using the default parameters: a required score of 0.400 (medium confidence) and an FDR stringency of 5% (medium). After this, we concatenate all of the genes in each local network together and perform a Multiple Protein search with this concatenated list. The full final network was imported with a 0 confidence cutoff.

With the full combined network, an adjaceny matrix was created using only the experimental interaction and coexpression scores, ranging from 0 to 1. Consequently, an weighted undirected graph was created using the R package igraph. All of the network proteins were clustered using the Leiden algorithm with a resolution parameter equal to 1.

The resulting graph is shown below with all of the labeled Cluster 2 proteins and APP. There were four clusters determined by the algorithm. As can be seen, HBA1 falls in its own cluster (green) and the histones (HIST1H2BC & HIST1H4F) fall in their own cluster (yellow) along with EEF1A1. The red cluster is composed of mostly Krebs Cycle-associated enzymes HADHA, HADHB, GOT2 and ACOT7. The turqoise cluster is composed of inner mitochondrial membrane proteins COX4I1, PHB2 and SLC25A6. The most interesting cluster (blue) contains the neuro-specific proteins of APP, TPPP, CNP, SYN1, MBP and FAM222A. This cluster also contains both of the experimetally determined nucleators of MBP and FAM222A. However, FAM222A appears to be the most disconnected out the others in the network. This is to be expected since very little is known is about FAM222A. All of the Cluster 2 proteins in this network were also identified as strong nucleator candidates as well.

Next, we can examine the most connected proteins in the network. To do this, we can calculate the betweenness centrality of each protein in the network over a range of cutoffs. The most central protein is GAPDH. This is to be expected considering that GAPDH is a “house-keeping” protein and is abundant in all cells. GAPDH was also determined to be one of the most abundant proteins in the amyloid plaques as well.

We can look at the connections between GAPDH and all of the other proteins. As can be seen, GAPDH is directly connected to all of the clusters. GAPDH and its connections are highlighted in red.

However, we’re more interested in the proteins that are directly connected to Alzheimer’s. For this, we can look at the connections between three of the mosthighly Alzheimer’s-associated genes APP, BACE1 and APOE. APP and its connections are highlighted in red within the graph below. Both, TPPP and MBP are directly connected to APP.

Here, BACE1 and its connections are highlighted in red within the graph below. Both, SYN1 and MBP are directly connected to BACE1.

Here, APOE and its connections are highlighted in red within the graph below. Only SYN1 is directly connected to APOE.